Skip to content

Conversation

@jan-elastic
Copy link
Contributor

fixes: #110134

@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v9.0.0 labels Nov 18, 2024
@jan-elastic jan-elastic added >bug :ml Machine learning Team:ML Meta label for the ML team v8.17.0 and removed needs:triage Requires assignment of a team area label labels Nov 18, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Hi @jan-elastic, I've created a changelog YAML for you.

@jan-elastic
Copy link
Contributor Author

jan-elastic commented Nov 18, 2024

Example request / response for the fix:

POST /test-index/_search
{
  "query": {
    "function_score": {
      "random_score": {
      }
    }
  },
  "aggs": {
    "random_sampler": {
      "random_sampler": {
        "probability": 0.5
      },
      "aggs": {
        "samples": {
          "top_hits": {
            "size": 2,
            "_source": [
              "text"
            ]
          }
        }
      }
    }
  }
}
{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 5,
      "relation": "eq"
    },
    "max_score": 0.96045524,
    "hits": [
      {
        "_index": "test-index",
        "_id": "ZjX1G5MBaC3tcnySK34-",
        "_score": 0.96045524,
        "_source": {
          "text": "text1"
        }
      },
      {
        "_index": "test-index",
        "_id": "ZTXrG5MBaC3tcnySAH4a",
        "_score": 0.82856405,
        "_source": {
          "text": [
            "text1",
            "text2"
          ]
        }
      },
      {
        "_index": "test-index",
        "_id": "aDX1G5MBaC3tcnySK34-",
        "_score": 0.33547562,
        "_source": {
          "text": [
            "text3",
            "text4"
          ]
        }
      },
      {
        "_index": "test-index",
        "_id": "ZDXrG5MBaC3tcnySAH4Z",
        "_score": 0.14092654,
        "_source": {
          "text": "text1"
        }
      },
      {
        "_index": "test-index",
        "_id": "ZzX1G5MBaC3tcnySK34-",
        "_score": 0.031006515,
        "_source": {
          "text": [
            "text1",
            "text2"
          ]
        }
      }
    ]
  },
  "aggregations": {
    "random_sampler": {
      "seed": -820777490,
      "probability": 0.5,
      "doc_count": 2,
      "samples": {
        "hits": {
          "total": {
            "value": 2,
            "relation": "eq"
          },
          "max_score": 0.96045524,
          "hits": [
            {
              "_index": "test-index",
              "_id": "ZjX1G5MBaC3tcnySK34-",
              "_score": 0.96045524,
              "_source": {
                "text": "text1"
              }
            },
            {
              "_index": "test-index",
              "_id": "ZDXrG5MBaC3tcnySAH4Z",
              "_score": 0.14092654,
              "_source": {
                "text": "text1"
              }
            }
          ]
        }
      }
    }
  }
}

Without the `random_score` you get similar results, all with scores of `1.0`.

@benwtrent
Copy link
Member

@jan-elastic I wonder how it would be with multiple layers of nesting? The original issue that @dgieselaar ran into I think had multiple layers of aggregations (random_sampler -> categorize_text -> top hits)

@jan-elastic
Copy link
Contributor Author

Regarding multiple layers of nesting: look like it works fine @benwtrent

{
  "query": {
    "function_score": {
      "random_score": {
      }
    }
  },
  "aggs": {
    "random_sampler": {
      "random_sampler": {
        "probability": 0.5
      },
      "aggs": {
        "message": {
          "categorize_text": {
            "size": 6,
            "field": "text"
          },
          "aggs": {
            "samples": {
              "top_hits": {
                "size": 1,
                "_source": [
                  "text"
                ]
              }
            }
          }
        }
      }
    }
  }
}
{
  "took": 14,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 5,
      "relation": "eq"
    },
    "max_score": 0.8408959,
    "hits": [
      {
        "_index": "test-index",
        "_id": "ZjX1G5MBaC3tcnySK34-",
        "_score": 0.8408959,
        "_source": {
          "text": "text1"
        }
      },
      {
        "_index": "test-index",
        "_id": "ZzX1G5MBaC3tcnySK34-",
        "_score": 0.72381663,
        "_source": {
          "text": [
            "text1",
            "text2"
          ]
        }
      },
      {
        "_index": "test-index",
        "_id": "ZDXrG5MBaC3tcnySAH4Z",
        "_score": 0.50730485,
        "_source": {
          "text": "text1"
        }
      },
      {
        "_index": "test-index",
        "_id": "ZTXrG5MBaC3tcnySAH4a",
        "_score": 0.24077094,
        "_source": {
          "text": [
            "text1",
            "text2"
          ]
        }
      },
      {
        "_index": "test-index",
        "_id": "aDX1G5MBaC3tcnySK34-",
        "_score": 0.17789865,
        "_source": {
          "text": [
            "text3",
            "text4"
          ]
        }
      }
    ]
  },
  "aggregations": {
    "random_sampler": {
      "seed": -685459259,
      "probability": 0.5,
      "doc_count": 2,
      "message": {
        "buckets": [
          {
            "doc_count": 4,
            "key": "text1",
            "regex": ".*?text1.*?",
            "max_matching_length": 5,
            "samples": {
              "hits": {
                "total": {
                  "value": 2,
                  "relation": "eq"
                },
                "max_score": 0.72381663,
                "hits": [
                  {
                    "_index": "test-index",
                    "_id": "ZzX1G5MBaC3tcnySK34-",
                    "_score": 0.72381663,
                    "_source": {
                      "text": [
                        "text1",
                        "text2"
                      ]
                    }
                  }
                ]
              }
            }
          },
          {
            "doc_count": 4,
            "key": "text2",
            "regex": ".*?text2.*?",
            "max_matching_length": 5,
            "samples": {
              "hits": {
                "total": {
                  "value": 2,
                  "relation": "eq"
                },
                "max_score": 0.72381663,
                "hits": [
                  {
                    "_index": "test-index",
                    "_id": "ZzX1G5MBaC3tcnySK34-",
                    "_score": 0.72381663,
                    "_source": {
                      "text": [
                        "text1",
                        "text2"
                      ]
                    }
                  }
                ]
              }
            }
          }
        ]
      }
    }
  }
}

@jan-elastic jan-elastic force-pushed the propagate-scoring-random-sampler branch from b27e007 to cfe449a Compare November 19, 2024 13:20
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for initial work, this is in a good place. Let's add some yaml testing and such.

@jan-elastic jan-elastic force-pushed the propagate-scoring-random-sampler branch from cfe449a to b09af6a Compare November 20, 2024 07:45
@benwtrent benwtrent added auto-backport Automatically create backport pull requests when merged v8.16.1 labels Nov 20, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @jan-elastic, I've updated the changelog YAML for you.

@jan-elastic jan-elastic force-pushed the propagate-scoring-random-sampler branch from 490fb1e to bbdc84e Compare November 20, 2024 13:23
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@jan-elastic jan-elastic merged commit dea1e7d into main Nov 20, 2024
17 checks passed
@jan-elastic jan-elastic deleted the propagate-scoring-random-sampler branch November 20, 2024 14:34
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.16 Commit could not be cherrypicked due to conflicts
8.x

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 116957

jan-elastic added a commit that referenced this pull request Nov 20, 2024
* Propagate scoring function through random sampler.

* Update docs/changelog/116957.yaml

* Correct score mode in random sampler weight

* Fix random sampling with scores and p=1.0

* Unit test with scores

* YAML test

* Add capability
elasticsearchmachine pushed a commit that referenced this pull request Nov 20, 2024
* Propagate scoring function through random sampler.

* Update docs/changelog/116957.yaml

* Correct score mode in random sampler weight

* Fix random sampling with scores and p=1.0

* Unit test with scores

* YAML test

* Add capability
elasticsearchmachine pushed a commit that referenced this pull request Nov 20, 2024
* Propagate scoring function through random sampler.

* Update docs/changelog/116957.yaml

* Correct score mode in random sampler weight

* Fix random sampling with scores and p=1.0

* Unit test with scores

* YAML test

* Add capability
rjernst pushed a commit to rjernst/elasticsearch that referenced this pull request Nov 20, 2024
* Propagate scoring function through random sampler.

* Update docs/changelog/116957.yaml

* Correct score mode in random sampler weight

* Fix random sampling with scores and p=1.0

* Unit test with scores

* YAML test

* Add capability
alexey-ivanov-es pushed a commit to alexey-ivanov-es/elasticsearch that referenced this pull request Nov 28, 2024
* Propagate scoring function through random sampler.

* Update docs/changelog/116957.yaml

* Correct score mode in random sampler weight

* Fix random sampling with scores and p=1.0

* Unit test with scores

* YAML test

* Add capability
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged backport pending >bug :ml Machine learning Team:ML Meta label for the ML team v8.16.1 v8.17.0 v9.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

top_hits fails with Null Pointer Exception when child of random_sampler agg

4 participants